import pandas as pd #importing libraries
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import plotly.graph_objects as go
import plotly.offline as pyo #for the USA plot to be shown offline
pyo.init_notebook_mode()
Here we can check all the columns/"variables" that are avaiable for the exploration, and start to see a putative path to follow.
- We can observe that the dataframe has 9800 rows and 18 columns.
df = pd.read_csv("train.csv") #read and overview of database
display(df)
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 |
| 1 | 2 | CA-2017-152156 | 08/11/2017 | 11/11/2017 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420.0 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 |
| 2 | 3 | CA-2017-138688 | 12/06/2017 | 16/06/2017 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | California | 90036.0 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 |
| 3 | 4 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 |
| 4 | 5 | US-2016-108966 | 11/10/2016 | 18/10/2016 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311.0 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9795 | 9796 | CA-2017-125920 | 21/05/2017 | 28/05/2017 | Standard Class | SH-19975 | Sally Hughsby | Corporate | United States | Chicago | Illinois | 60610.0 | Central | OFF-BI-10003429 | Office Supplies | Binders | Cardinal HOLDit! Binder Insert Strips,Extra St... | 3.7980 |
| 9796 | 9797 | CA-2016-128608 | 12/01/2016 | 17/01/2016 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615.0 | East | OFF-AR-10001374 | Office Supplies | Art | BIC Brite Liner Highlighters, Chisel Tip | 10.3680 |
| 9797 | 9798 | CA-2016-128608 | 12/01/2016 | 17/01/2016 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615.0 | East | TEC-PH-10004977 | Technology | Phones | GE 30524EE4 | 235.1880 |
| 9798 | 9799 | CA-2016-128608 | 12/01/2016 | 17/01/2016 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615.0 | East | TEC-PH-10000912 | Technology | Phones | Anker 24W Portable Micro USB Car Charger | 26.3760 |
| 9799 | 9800 | CA-2016-128608 | 12/01/2016 | 17/01/2016 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615.0 | East | TEC-AC-10000487 | Technology | Accessories | SanDisk Cruzer 4 GB USB Flash Drive | 10.3840 |
9800 rows × 18 columns
- Since all the 11 null values are from Burlington city, at Vermont State, i am going to replace the values with a valid chosen Postal Code (5401);
- also, both Date columns data type were detected as string. We also want to convert them to datetime.
df.info() #info to check for null values
df2 = df[df["Postal Code"].isna()] #filtering the 11 null values - present only in Postal Code column
display(df2)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9800 entries, 0 to 9799 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9800 non-null int64 1 Order ID 9800 non-null object 2 Order Date 9800 non-null object 3 Ship Date 9800 non-null object 4 Ship Mode 9800 non-null object 5 Customer ID 9800 non-null object 6 Customer Name 9800 non-null object 7 Segment 9800 non-null object 8 Country 9800 non-null object 9 City 9800 non-null object 10 State 9800 non-null object 11 Postal Code 9789 non-null float64 12 Region 9800 non-null object 13 Product ID 9800 non-null object 14 Category 9800 non-null object 15 Sub-Category 9800 non-null object 16 Product Name 9800 non-null object 17 Sales 9800 non-null float64 dtypes: float64(2), int64(1), object(15) memory usage: 1.3+ MB
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2234 | 2235 | CA-2018-104066 | 05/12/2018 | 10/12/2018 | Standard Class | QJ-19255 | Quincy Jones | Corporate | United States | Burlington | Vermont | NaN | East | TEC-AC-10001013 | Technology | Accessories | Logitech ClearChat Comfort/USB Headset H390 | 205.03 |
| 5274 | 5275 | CA-2016-162887 | 07/11/2016 | 09/11/2016 | Second Class | SV-20785 | Stewart Visinsky | Consumer | United States | Burlington | Vermont | NaN | East | FUR-CH-10000595 | Furniture | Chairs | Safco Contoured Stacking Chairs | 715.20 |
| 8798 | 8799 | US-2017-150140 | 06/04/2017 | 10/04/2017 | Standard Class | VM-21685 | Valerie Mitchum | Home Office | United States | Burlington | Vermont | NaN | East | TEC-PH-10002555 | Technology | Phones | Nortel Meridian M5316 Digital phone | 1294.75 |
| 9146 | 9147 | US-2017-165505 | 23/01/2017 | 27/01/2017 | Standard Class | CB-12535 | Claudia Bergmann | Corporate | United States | Burlington | Vermont | NaN | East | TEC-AC-10002926 | Technology | Accessories | Logitech Wireless Marathon Mouse M705 | 99.98 |
| 9147 | 9148 | US-2017-165505 | 23/01/2017 | 27/01/2017 | Standard Class | CB-12535 | Claudia Bergmann | Corporate | United States | Burlington | Vermont | NaN | East | OFF-AR-10003477 | Office Supplies | Art | 4009 Highlighters | 8.04 |
| 9148 | 9149 | US-2017-165505 | 23/01/2017 | 27/01/2017 | Standard Class | CB-12535 | Claudia Bergmann | Corporate | United States | Burlington | Vermont | NaN | East | OFF-ST-10001526 | Office Supplies | Storage | Iceberg Mobile Mega Data/Printer Cart | 1564.29 |
| 9386 | 9387 | US-2018-127292 | 19/01/2018 | 23/01/2018 | Standard Class | RM-19375 | Raymond Messe | Consumer | United States | Burlington | Vermont | NaN | East | OFF-PA-10000157 | Office Supplies | Paper | Xerox 191 | 79.92 |
| 9387 | 9388 | US-2018-127292 | 19/01/2018 | 23/01/2018 | Standard Class | RM-19375 | Raymond Messe | Consumer | United States | Burlington | Vermont | NaN | East | OFF-PA-10001970 | Office Supplies | Paper | Xerox 1881 | 12.28 |
| 9388 | 9389 | US-2018-127292 | 19/01/2018 | 23/01/2018 | Standard Class | RM-19375 | Raymond Messe | Consumer | United States | Burlington | Vermont | NaN | East | OFF-AP-10000828 | Office Supplies | Appliances | Avanti 4.4 Cu. Ft. Refrigerator | 542.94 |
| 9389 | 9390 | US-2018-127292 | 19/01/2018 | 23/01/2018 | Standard Class | RM-19375 | Raymond Messe | Consumer | United States | Burlington | Vermont | NaN | East | OFF-EN-10001509 | Office Supplies | Envelopes | Poly String Tie Envelopes | 2.04 |
| 9741 | 9742 | CA-2016-117086 | 08/11/2016 | 12/11/2016 | Standard Class | QJ-19255 | Quincy Jones | Corporate | United States | Burlington | Vermont | NaN | East | FUR-BO-10004834 | Furniture | Bookcases | Riverside Palais Royal Lawyers Bookcase, Royal... | 4404.90 |
#since all the 11 null values are from City Burlington, at Vermont State - replaced with a valid Postal Code (5401)
df3 = df[(df["Postal Code"] >= 5401.0) & (df['Postal Code'] <=5408.0)]
display(df3) #verifying if theres any Burlington/Vermont Postal Code in database - no
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales |
|---|
df.loc[df["Postal Code"].isnull(), "Postal Code"] = 5401.0 #replacing all 11 null Postal Code values with float 5401.0
df3 = df[(df["Postal Code"] >= 5401.0) & (df['Postal Code'] <=5408.0)] #verifying again - no null values in database.
df3.head(2)
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2234 | 2235 | CA-2018-104066 | 05/12/2018 | 10/12/2018 | Standard Class | QJ-19255 | Quincy Jones | Corporate | United States | Burlington | Vermont | 5401.0 | East | TEC-AC-10001013 | Technology | Accessories | Logitech ClearChat Comfort/USB Headset H390 | 205.03 |
| 5274 | 5275 | CA-2016-162887 | 07/11/2016 | 09/11/2016 | Second Class | SV-20785 | Stewart Visinsky | Consumer | United States | Burlington | Vermont | 5401.0 | East | FUR-CH-10000595 | Furniture | Chairs | Safco Contoured Stacking Chairs | 715.20 |
- I created two new columns so i can analyze the data period from different "frames" (Year or Months);
- i also created a "Days to Ship" column, with the period since the Order Date untill the start of the ship (Ship Date).
#treating Dates
df["Order Date"] = pd.to_datetime(df["Order Date"], dayfirst=True) #now its a datetime column
df["Ship Date"] = pd.to_datetime(df["Ship Date"], dayfirst=True)
df["Year"] = df["Order Date"].dt.year #creatins new column with year to take a first look at the sales by diferent time periods.
df["Year Month"] = df["Order Date"].dt.year*100 + df["Order Date"].dt.month #creating new column only w/ Y and M (e.g. 202208).
#Just in case i want to group by sales per months. Strings can be better in graphs in some cases.
df["Year Month"] = df["Year Month"].map(str)
#creating a new column with the days it took to start the product ship.
df["DaystoShip"] = (df["Ship Date"] - df["Order Date"])
df["DaystoShip"] = df["DaystoShip"].dt.days
df.head(2)
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Year | Year Month | DaystoShip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CA-2017-152156 | 2017-11-08 | 2017-11-11 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420.0 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.96 | 2017 | 201711 | 3 |
| 1 | 2 | CA-2017-152156 | 2017-11-08 | 2017-11-11 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420.0 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.94 | 2017 | 201711 | 3 |
2 rows × 21 columns
- "Row ID", "Order ID", "Customer Name", "Country", "Postal Code" and "Product ID".
#removing columns i believe are not going to help.
display(df["Country"].value_counts()) #only USA orders.
df = df.drop(["Row ID", "Order ID", "Customer Name", "Country", "Postal Code", "Product ID"], axis=1)
df.head(2)
United States 9800 Name: Country, dtype: int64
| Order Date | Ship Date | Ship Mode | Customer ID | Segment | City | State | Region | Category | Sub-Category | Product Name | Sales | Year | Year Month | DaystoShip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-11-08 | 2017-11-11 | Second Class | CG-12520 | Consumer | Henderson | Kentucky | South | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.96 | 2017 | 201711 | 3 |
| 1 | 2017-11-08 | 2017-11-11 | Second Class | CG-12520 | Consumer | Henderson | Kentucky | South | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.94 | 2017 | 201711 | 3 |
- The period total sales was 2,261,536.78 dollars;
- the whole period max order value was 22,638.48 dollars.
*We can also start to look at the sales by Year.
print("Sales statistics at the whole period analyzed:\n")
print(df["Sales"].describe()) #Sales statistics info
print("-"*30)
print("Period Total Sales:\n")
print(df["Sales"].sum())
print("-"*30)
print("Sales per Year:")
df5 = df[["Year", "Sales"]]
display(df5.groupby("Year")["Sales"].sum().sort_values(ascending=False)) #4 years of orders - 2015~2018
Sales statistics at the whole period analyzed: count 9800.000000 mean 230.769059 std 626.651875 min 0.444000 25% 17.248000 50% 54.490000 75% 210.605000 max 22638.480000 Name: Sales, dtype: float64 ------------------------------ Period Total Sales: 2261536.7827000003 ------------------------------ Sales per Year:
Year 2018 722052.0192 2017 600192.5500 2015 479856.2081 2016 459436.0054 Name: Sales, dtype: float64
#A more basic plot.
display(df["Year"].value_counts()) #the Number of orders raised every year.
df4 = df["Year"].value_counts()
x = df4.index
y = df4.values
plt.bar(x, y)
plt.xticks(x)
plt.show()
2018 3258 2017 2534 2016 2055 2015 1953 Name: Year, dtype: int64
*you can observe that this plot already gives us the percentage of the difference in sales between one year and the year before.
- The sales decreased 4.26% in 2016, but increased 30.64% and 20.30% in 2017 and 2018, respectively;
- 2018 was the best year in sales, followed by 2017.
#Now a more refined plot.
df6 = df.groupby("Year")["Sales"].sum()
fig, ax = plt.subplots(figsize=(8,5))
ax.bar(df6.index, df6.values, width = 0.7, color='#84ba5b', label="Sales by Year - in dollars\nRaise or Fall in Sales (%) compared w/ Past Year")
plt.xticks(ticks=[2015,2016,2017,2018])
ax.legend()
for i in range(len(df6)):
ax.annotate(df6.values[i].astype(int), (df6.index[i],df6.values[i]+5000), ha="center") #values over bars
if df6.index[i] == 2015:
pass
else:
percent = (df6.values[i]/df6.values[i-1]) - 1
ax.annotate((f"{percent:.2%}"), (df6.index[i],df6.values[i]-40000), ha="center") #difference percentage inside bars
#you can observe that this plot already gives us the percentage of the difference in sales between one year and the year before.
plt.show()
- The best month in sales was 2018/November;
- every year the best month was November or December, except in 2015 (September);
* We can start to observe that maybe theres a trend of growth in sales after 2016, and a similar seasonality in
sales over each year.
- The number of months where the sales were higher than the whole period sales median decreased from 4 to 3 between 2015
and 2016. After that, this number increased to 7 and 10 in 2017 and 2018, respectively. So, in 2018 the sales were
higher than this median almost the entire year, and in November the sales were higher than 3 times this median value.
#Now a look into the sales by month
df7 = df[["Year Month", "Sales"]] #new DF w/ only the info i need now.
df8 = df7.groupby("Year Month")["Sales"].sum() #new DF - groupby year month, sum sales value
salesmonth_mean = df8.values.mean() #mean value
#print(type(salesmonth_mean))
df9 = [[salesmonth_mean] * len(df8)] #new df just to plot the mean on the graph.
#print(type(df9))
salesmonth_median = df8.median() #median
df17 = [[salesmonth_median] * len(df8)] #new df just to plot the median on the graph.
#Plotting
fig, ax = plt.subplots(figsize=(12,6.5))
ax.bar(df8.index[0:12], df8.values[0:12], color='#a8d586', label="Sales by Month - 2015")
ax.bar(df8.index[12:24], df8.values[12:24], color='#f1b499', label="Lower Sales Year - 2016")
ax.bar(df8.index[24:36], df8.values[24:36], color='#5f9833', label="Better Sales Year - 2017")
ax.bar(df8.index[36:48], df8.values[36:48], color='#84ba5b', label="Even better Sales Year - 2018 (w/ lower growth rate)")
ax.plot(df8.index, df9[0], linewidth=2.5, label="Mean - in dollars") #mean over the plot
ax.plot(df8.index, df17[0], linewidth=2.5, label="Median - in dollars") #median over the plot
ax.annotate(int(df8.values[8]), (df8.index[8], df8.values[8] + 1000), ha = "center") #sales best month for each year
ax.annotate(int(df8.values[22]), (df8.index[22], df8.values[22] + 1000), ha = "center")
ax.annotate(int(df8.values[35]), (df8.index[35], df8.values[35] + 1000), ha = "center")
ax.annotate(int(df8.values[46]), (df8.index[46], df8.values[46] + 1000), ha = "center")
ax.annotate(int(salesmonth_mean), (df8.index[0], salesmonth_mean - 4000), ha="center") #mean value in plot
ax.annotate(int(salesmonth_median), (df8.index[0], salesmonth_median - 4000), ha="center") #median value in plot
# Rotating legends so they dont overlap
ax.tick_params('x',labelrotation=82)
# Adding legend
ax.legend();
#We have ship mode and DaystoShip to explore
df18 = df["Ship Mode"].value_counts()
x = df18.index
y = df18.values
plt.bar(x, y, color="#FFB857", label="Total Orders by Ship Mode")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df18.values[i], (df18.index[i], df18.values[i]-300), ha="center")
plt.legend()
#aparently the pattern observed here is the expected with more people ordering by the standard class
plt.subplots()
df10 = df[["Ship Mode", "DaystoShip", "Sales"]]
df11 = df10.groupby("Ship Mode")["Sales"].sum().astype(int) #sales sum by ship mode - nothing calls attention
x = df11.index
y = df11.values
plt.bar(x, y, color="orange", label="Sales by Ship Mode,\n in dollars")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df11.values[i], (df11.index[i], df11.values[i]-70000), ha="center")
plt.legend()
plt.show()
- Max Days to Ship is seven - not bad. But could it be better? for First, Second and Standard classes.
df12 = df10.groupby("Ship Mode")["DaystoShip"].mean().astype(int).sort_values() #Thats the only point where i observe something...
df19 = [[df10["DaystoShip"].max()] * len(df12)] #new df just to plot the median on the graph.
x = df12.index.tolist()
y = df12.values.astype(int)
fig, ax = plt.subplots(figsize=(10,5))
ax.bar(x, y, color='#7dfb7d', label="Days to Ship - Mean")
ax.hlines(df10["DaystoShip"].max(), -0.5, 3.5, label="Days to Ship - Max", color = "r") #max over the plot
#Max Days to Ship is seven - not bad. But could it be better? for most classes.
plt.yticks(ticks=[0,1,2,3,4,5,6,7,8])
for i in range(len(x)):
ax.annotate(y[i], (x[i], y[i]+0.2), ha="center") #mean value over the bars
ax.legend(loc=(6));
- The segment that orders the most is Consumer;
- the main Segment for the Sales is Consumer too.
#We have segment
df20 = df["Segment"].value_counts()
x = df20.index
y = df20.values
plt.bar(x, y, color="#F0A2FC", label="Total Orders by Segment")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df20.values[i], (df20.index[i], df20.values[i]-300), ha="center")
plt.legend()
#only 3 segments - Consumer, Corporate and Home Office
plt.subplots()
df13 = df[["Segment", "Customer ID", "Region", "City", "State", "Sales"]]
df14 = df13.groupby("Segment")["Sales"].sum().astype(int) #Consumer is the most important segment for sales.
x = df14.index
y = df14.values
plt.bar(x, y, color="#EA7DFB", label="Sales by Segment,\n in dollars")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df14.values[i], (df14.index[i], df14.values[i]-70000), ha="center")
plt.legend()
plt.show()
- We got 793 different customers;
- only 6 customers are "single-time buyers";
#We have Customer ID
print("Orders by Customer")
display(df["Customer ID"].value_counts()) #we got 793 different customers
print("-"*40)
print("Less orders Customers\n")
print(df["Customer ID"].value_counts().tail(10)) #only 6 customers are "single-time buyers"
Orders by Customer
WB-21850 35
MA-17560 34
PP-18955 34
JL-15835 33
CK-12205 32
..
JR-15700 1
CJ-11875 1
SC-20845 1
RE-19405 1
AO-10810 1
Name: Customer ID, Length: 793, dtype: int64
---------------------------------------- Less orders Customers AR-10570 2 IM-15055 2 TS-21085 2 PB-19210 2 LD-16855 1 JR-15700 1 CJ-11875 1 SC-20845 1 RE-19405 1 AO-10810 1 Name: Customer ID, dtype: int64
- Top 10 Customers by Sales - SM-20320 is the customer with the biggest revenue by a good margin from the second.
df21 = df.groupby(["Customer ID"])["Sales"].sum().astype(int).sort_values(ascending=False).head(10)
#Top 10 Customers by Sales - SM-20320 is the customer with the biggest revenue by a good margin from the second.
fig, ax = plt.subplots(figsize=(10,6))
ax.barh(df21.index, df21.values, color='#F5EA14', edgecolor = "black", label="Sales by Customers,\n in dollars")
ax.invert_yaxis()
for i in range(len(df21)):
ax.annotate(df21.values[i], (df21.values[i]-900, df21.index[i]), ha="center")
ax.legend(loc=7);
- West and East are the best regions, respectively, in order numbers and in sales;
- Central and South are both the lowest order regions. South represents only 16% of the orders;
- And they are both the lowest sales regions.
*Maybe we could raise the marketing there?
#and we have location data - country Region, City, and State.
df16 = df["Region"].value_counts()
regionindex = [df16.index]
regionvalues = [df16.values]
plt.title("Region Orders", fontsize=14)
plt.pie(df16, labels = df16.index, autopct='%.2f');
#And they are both the lower orders regions. South represents only 16% of the orders. Maybe we could raise the marketing there?
df15 = df13.groupby("Region")["Sales"].sum().astype(int) #Sales is lower in Central and even lower in South.
plt.subplots()
x = df15.index
y = df15.values
plt.bar(x, y, color="#FFBD5E", label="Sales by Region,\n in dollars")
plt.xticks(x)
plt.yticks([0, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000])
for i in range(len(x)):
plt.annotate(df15.values[i], (df15.index[i], df15.values[i]-40000), ha="center")
plt.legend(loc = "upper left")
plt.show()
- The states that ordered the most are California (West) and NY (East);
- And the sales by state follows a very similar order, with California being the state with the biggest revenue,
followed by NY, Texas (a central state), Washington and Pennsylvania.
df22 = df["State"].value_counts().head(10)
fig, ax = plt.subplots(figsize=(10,6))
x = df22.index
y = df22.values
plt.bar(x, y, color="#53BEFF", label="Orders by State")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df22.values[i], (df22.index[i], df22.values[i]+20), ha="center")
plt.tick_params('x',labelrotation=70, labelsize="large")
plt.legend()
plt.show()
- https://www.kaggle.com/code/samruddhim/part-1-exploratory-data-analysis
state = ['Alabama', 'Arizona' ,'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana','Nebraska', 'Nevada', 'New Hampshire',
'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin','Wyoming']
state_code = ['AL','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA',
'MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN',
'TX','UT','VT','VA','WA','WV','WI','WY']
state_df = pd.DataFrame(state, state_code) # Creating a dataframe for states information
state_df.reset_index(level=0, inplace=True) ##fixing axis lenght
state_df.columns = ['State Code','State']
sales = df.groupby(["State"]).sum().sort_values("Sales", ascending=False) #Creating a dataframe for sales by state
sales= sales.sort_values('State', ascending=True)
sales.reset_index(inplace = True)
sales["State Code"] = state_df["State Code"]
sales['text'] = sales['State']
fig = go.Figure(data=go.Choropleth(
locations=sales['State Code'], # Spatial coordinates
text=sales['text'],
z = sales['Sales'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in "locations"
colorscale = 'balance',
colorbar_title = "Sales",
))
fig.update_layout(
title_text = 'Total Sales by State',
geo_scope='usa', # limite map scope to USA
)
fig.show();
- New York City is the city with the biggest revenue (more than 10% of the business revenue) and is also the city that
orders the most, in both cases followed by Los Angeles;
- NYC and LA ordered more than the fourth State (Pennsylvania);
- and NYC and LA sold more alone than the entire Texas, the third state in total sales.
*So, they are the most important cities to the business.
df23 = df["City"].value_counts().head(10)
fig, ax = plt.subplots(figsize=(10,6))
x = df23.index
y = df23.values
plt.bar(x, y, color="#E19EFC", label="Orders by City")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df23.values[i], (df23.index[i], df23.values[i]+10), ha="center")
plt.tick_params('x',labelrotation=70, labelsize="large")
plt.legend()
fig, ax = plt.subplots(figsize=(10,6))
df24 = df.groupby(["City"])["Sales"].sum().astype(int).sort_values(ascending=False).head(10)
x = df24.index
y = df24.values
plt.bar(x, y, color="#D877FF", label="Sales by City,\n in dollars")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df24.values[i], (df24.index[i], df24.values[i]-10000), ha="center")
plt.tick_params('x',labelrotation=70, labelsize="large")
plt.legend()
plt.show()
# Wont explore much Cities here. Especially over different periods of time.
# Just to check the biggest city revenue and the city that orders the most - NYC both!
* We have Category and Sub-Category data
- Tech is the category with the biggest revenue, even with Office Supplies being the most ordered category;
*Thats probably because the tech products sale price (much higher than office supplies).
- But theres a certain balance in revenue by Category. There's no category with a very discrepant sales compared to
the others.
#We have Category and Sub-Category
df25 = df["Category"].value_counts()
x = df25.index
y = df25.values
plt.bar(x, y, color="#F6B8FF", label="Total Orders by Category")
plt.xticks(x)
for i in range(len(x)):
plt.annotate(df25.values[i], (df25.index[i], df25.values[i]-300), ha="center")
plt.legend()
df26 = df.groupby("Category")["Sales"].sum().astype(int).sort_values(ascending=False)
#Tech is the segment with the biggest revenue.
plt.subplots()
x = df26.index
y = df26.values
plt.bar(x, y, color="#ED6EFF", label="Total Sales by Category,\n in dollars")
plt.xticks(x)
plt.yticks([0, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000])
for i in range(len(x)):
plt.annotate(df26.values[i], (df26.index[i], df26.values[i]-50000), ha="center")
plt.legend()
plt.show()
"""Even with Office Supplies being the most ordered segment.
Thats probably because the tech products sale price (much higher than office supplies).""";
- Most ordered sub-categories are Binders and Papers (Office Supplies) - nothing calls attention;
df27 = df["Sub-Category"].value_counts() #Most ordered sub-categories are Binders and Papers (Office Supplies).
fig, ax = plt.subplots(figsize=(10,9))
ax.barh(df27.index, df27.values, color='#B9FF6F', edgecolor = "black", label="Total Orders by Sub-Category")
ax.invert_yaxis()
for i in range(len(df27)):
ax.annotate(df27.values[i], (df27.values[i]-40, df27.index[i]), ha="center")
ax.legend(loc=7);
- but the sub-categories with the biggest revenues are Phones (Technology) and Chairs (Furniture).
df28 = df.groupby("Sub-Category")["Sales"].sum().astype(int).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(10,9))
ax.barh(df28.index, df28.values, color='#CDFF99', edgecolor = "g", label="Total Sales by Sub-Category,\n in dollars")
ax.invert_yaxis()
for i in range(len(df28)):
if df28.values[i] > 300000:
ax.annotate(df28.values[i], (df28.values[i]-14000, df28.index[i]), ha="center")
else:
ax.annotate(df28.values[i], (df28.values[i]+13000, df28.index[i]), ha="center")
ax.legend(loc=7);
#We have Product Name
df29 = df["Product Name"].value_counts().head(20)
#Most ordered products are Staple Envelope, Staples and Easy-staple paper (Office Supplies).
fig, ax = plt.subplots(figsize=(10,9))
ax.barh(df29.index, df29.values, color='#99EBFF', edgecolor = "black", label="Total Orders by Product")
ax.invert_yaxis()
for i in range(len(df29)):
ax.annotate(df29.values[i], (df29.values[i]+1, df29.index[i]), ha="center")
ax.legend(loc=7);
- Even not being in the 20 most ordered itens, Canon imageCLASS 2200 Advanced Copier is the first item in revenue, with more than double of the second item's revenue;
- none of the 10 itens with the biggest revenues are between the 20 most ordered itens!
df30 = df.groupby(["Product Name"])["Sales"].sum().astype(int).sort_values(ascending=False).head(10)
"""Even not being in the 20 most ordered itens, Canon imageCLASS 2200 Advanced Copier is the first item in revenue.
With more than double of the second item's revenue.""";
fig, ax = plt.subplots(figsize=(10,9))
ax.barh(df30.index, df30.values, color='#4CDCFF', edgecolor = "b", label="Total Sales by Product,\n in dollars")
ax.invert_yaxis()
for i in range(len(df30)):
ax.annotate(df30.values[i], (df30.values[i]-3000, df30.index[i]), ha="center")
ax.legend(loc=7);
every year the best month was November or December, except in 2015 (September).
*We can start to think that maybe theres a trend of growth in sales after 2016, and a similar seasonality in sales
over each year.
Sales sum by ship mode follows the same pattern than the orders by ship mode - standard class is main segment for the company, with the biggest revenue (1,340,831 dollars - 59.2% from the total sales).
*max Days to Ship is seven - not bad depending on the situation . But could it be better? for First, Second and
mainly Standard class (since it's the main segment for the company).
The main Segment for the Sales is Consumer (1,148,060 dollars - 50.7% from the total sales), followed by Corporate (688,494 dollars - 30.4% from the total) and Home Office (424,982 dollars - 18.7% from the total).
We got 793 different customers.
*only 6 customers are "single-time buyers" - what's an awesome thing, our customers tend to buy again after the
first time.
West and East are the best regions, respectively, in order numbers (32.04% and 28.42%, respectively) and in sales numbers (710,219 and 669,518 dollars, respectively);
and they both are also the lowest sales regions (Central with 492,626 dollars in revenue, and South with 389,151 dollars).
*Maybe we could raise the marketing there?
The states that ordered the most are California (West) and NY (East);
*So, they are the most important cities to the business!
*Thats probably because the tech products sale price (much higher than office supplies).
but theres a certain balance in revenue by Category. There's no category with a very discrepant sales compared to the others.
*So we can assume that all the three segments are very important to the business.
Most ordered sub-categories are Binders and Papers (Office Supplies);
*All of them with more than double of the orders of the fourth item (Avery Non-Sticky Binders - also Office
Supplies).